Swift Regular Expression Matching
ثبت نشده
چکیده
Queries involving Regular Expressions (RegEx) have a wide range of applications including document stores, bioinformatics and information retrieval. However, efficiently executing RegEx queries over large datasets remains a challenging task. Data scans do not scale well with input size; however, existing techniques that avoid data scans — referred to as “black-box” approaches — offer little or no benefit over data scans for RegEx. The latter typically execute RegEx queries by decomposing the query along operators, computing intermediate results for individual sub-queries (using indexes and/or partial data scans) and combining the intermediate results along respective operators. We analyze the black-box approach and identify operators for which the execution time of the black-box approach can be far from optimal. We then propose Swift, a set of transformations over the original RegEx that allow avoiding the black-box approach for such operators. We implement Swift over several data structures (including suffix trees, suffix arrays, compressed indexes, etc.) and show that Swift achieves significant speedups over the black-box approach and over popular open-source data stores that support RegEx via data scans, sometimes by as much as two orders of magnitude.
منابع مشابه
Swift Regular Expression Matching
Queries involving Regular Expressions (RegEx) have a wide range of applications including document stores, bioinformatics and information retrieval. However, efficiently executing RegEx queries over large datasets remains a challenging task. Data scans do not scale well with input size; however, existing techniques that avoid data scans — referred to as “black-box” approaches — offer little or ...
متن کاملApproximate Regular Expression Matching
We extend the de nition of Hamming and Levenshtein distance between two strings used in approximate string matching so that these two distances can be used also in approximate regular expression matching. Next, the methods of construction of nondeterministic nite automata for approximate regular expression matching considering both mentioned distances are presented.
متن کاملPrefix-Free Regular-Expression Matching
We explore the regular-expression matching problem with respect to prefix-freeness of the pattern. We show that the prefix-free regular expression gives only linear number of matching substrings in the size of a given text. Based on this observation, we propose an efficient algorithm for the prefix-free regular-expression matching problem. Furthermore, we suggest an algorithm to determine wheth...
متن کاملPrefix-free regular languages and pattern matching
We explore the regular-expression matching problem with respect to prefix-freeness of the pattern. We prove that a prefix-free regular expression gives only a linear number of matching substrings in the size of a given text. Based on this observation, we propose an efficient algorithm for the prefix-free regular-expression matching problem. Furthermore, we suggest an algorithm to determine whet...
متن کاملRegular Expression Matching on Graphics Hardware for Intrusion Detection
The expressive power of regular expressions has been often exploited in network intrusion detection systems, virus scanners, and spam filtering applications. However, the flexible pattern matching functionality of regular expressions in these systems comes with significant overheads in terms of both memory and CPU cycles, since every byte of the inspected input needs to be processed and compare...
متن کامل